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Current lexica and machine learning based sentiment analysis approaches 
still suffer from a two-fold limitation. First, manual lexicon construction and 
machine training is time consuming and error-prone. Second, the 
prediction’s accuracy entails sentences and their corresponding training text 
should fall under the same domain. In this article, we experimentally 
evaluate four sentiment classifiers, namely support vector machines (SVMs), 
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We 
quantify the quality of each of these models using three real-world datasets 
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic 
movie reviews. Specifically, we study the impact of a variety of natural 
language processing (NLP) pipelines on the quality of the predicted 
sentiment orientations. Additionally, we measure the impact of incorporating 
lexical semantic knowledge captured by WordNet on expanding original 
words in sentences. Findings demonstrate that the utilizing different NLP 
pipelines and semantic relationships impacts the quality of the sentiment 


analyzers. In particular, results indicate that coupling lemmatization and 
knowledge-based n-gram features proved to produce higher accuracy results. 
With this coupling, the accuracy of the SVM classifier has improved to 
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three 
other classifiers. 
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1. INTRODUCTION 

Sentiment analysis (SA) has become one of the most reliable tools for assisting organizations better 
understand the perception of their users about the products and services that they offer [1], [2]. Exploiting SA 
techniques in real-world application domains also serves individuals who are interested in learning more 
about the various customer perceptions on the products and/or services of interest. Recently, there has been a 
growing number of SA techniques which can be characterized by a number of strengths and weaknesses; 
demonstrated by the quality of their prediction accuracy. Lexicon-based and machine learning approaches 
have been among the most common techniques for this purpose. As far as lexicon-based approaches are 
concerned, a lexicon containing a set of words with their polarities, such as SentiWordNet is employed to 
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predict the sentiment orientation of word mentions in sentences. However, depending on a lexicon alone is 
not sufficient due to the fact that a word's prior polarity doesn't reflect its contextual polarity in a sentence 
[3]. In addition, various semantic relations and axioms should be incorporated; requiring to explicitly add 
such entities. On the other hand, machine learning methods rely on training samples to predict the polarity of 
opinions. This normally involves supervised learning, wherein a group of sentences is classified into several 
labelled classes, such as positive, neutral and negative. Although this approach can detect polarity of 
sentences; however, it is time-consuming and can only analyze the sentiments of texts which belong to the 
same field of the training samples. In addition, the size of the training samples should be large enough in 
order for this approach to yield good results. Starting from this position, we study the main features that 
characterize these techniques and evaluate their effectiveness using large-scale real-world datasets that 
comprise sentiment reviews. We focus on the feature engineering process and explore the impact of using 
semantic and taxonomic relationships, as well as different priorities for word types (Noun (NN), Adjective 
(ADJ), Adverb (ADV), Verb (VV)) to perform sentence-level SA. The main contributions of our proposition 
are summarized: 
— Explore existing SA techniques and evaluate their effectiveness, as well as efficiency using three 
publicly-available datasets. 
— Exploiting knowledge triplets encoded in generic knowledge repositories and measuring their impact on 
the quality of the utilized SA models. 

The rest of this paper is organized into four sections. In section 2, we introduce the research method 
and discuss the related works. We also provide the theoretical details and algorithms used in the proposed 
natural language processing (NLP). In section 3, we introduce the experimental evaluation steps that we 
carried out to evaluate a variety of SA pipelines. We also compare our proposed SA model with state-of-the- 
art techniques and discuss the findings accordingly. In section 4, we present the conclusions and highlight the 
future directions of our proposed work. 


2. RESEARCH METHOD 

Sentiment Analysis can be defined as the practice of employing NLP and text processing techniques 
to identify the sentiment orientation of text [4]. Researchers exploit various NLP techniques; coupling text 
analysis, in addition to using extrinsic resources to assist in identifying the semantic orientation of sentences. 
Among these techniques are: (i) Lexicon-based and (ii) Machine learning approaches [5]-[10], [11]-[16]. For 
instance, in [5], the authors compared between support vector machine (SVM) and artificial neural networks 
(ANNs). Both models were utilized to classify the polarity of texts at document level. As reported by the 
authors, the SVM has demonstrated to be less effective than ANNs, especially when using unbalanced data 
contexts. Nevertheless, experiments showed some limitations, specifically in terms of the high computational 
cost required for processing and analyzing large-size documents. Therefore, as reported in Reference [6], 
working on document level is expensive and inaccurate in some scenarios. This is mainly because the whole 
document is treated as a single unit and the entire content of the document is classified as positive or negative 
towards a specific issue. This problem also appears when working at the paragraph level, as a paragraph may 
contain two sentences with different sentiment polarities. Therefore, other researchers have focused on SA at 
sentence level [7], [8]. In Reference [7], Pak and Paroubek analyzed a dataset of financial reports at sentence 
level using multiple sentence embedding models. They used models based on Siamese CBow and fastText. 
However, as reported by Basha and Rajput in [9], using a lexicon with predefined term polarities doesn’t 
reflect any contextual sensitive polarities at the sentence level. To address this issue, MeSkelé and Frasincar 
proposed a hybrid model called ALDONA [10], that combines lexicon based and machine learning 
approaches. The results showed more accurate results than state-of-the-art models, namely when dealing with 
sentences with complex structures. In the same line of research, we can also find some researchers who 
analyzed the sentiment polarity at a word level [11]. For example, Chen ef al. utilized the reinforcement 
learning (RL) method and combined it with a long-short term memory (LSTM) model to extract word 
polarities. This approach succeeded in correctly determining the polarity of the word in most cases but it also 
failed in some scenarios. When exploring the related works, we can find out that main factor that has a major 
impact on the quality of SA results is the feature selection as reposted in [12], where the authors compared 
three different features selection methods, namely feature frequency (FF), term frequency — inverse 
document frequency (TF-IDF), and feature presence (FP). The Chi-square method was utilized to extract and 
filter the most relevant features, and the SVM classifier was used to evaluate the system’s performance. 
Results confirmed that the SVM classifier’s performance varied significantly depending on input features. In 
a similar work proposed by Alam and Yao [13], the authors developed a methodology for comparing the 
performance of different machine learning classifiers (SVM, Naive Bayes (NB), and maximum entropy 
(MaxE)). The authors used different NLP techniques (stemming, stopwords removal and emoticons removal) 
to extract significant features. They also used sets of n-gram token types (Uni-grams, Bi-grams, Uni-grams 
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and Bi-grams, Uni-gram with part of speech tags), in addition to using Word2vec technique for word 
vectorization, in an attempt to increase the classification accuracy. The results confirmed that the NLP 
techniques had a significant impact on the quality of the employed sentiment analyzers. Results confirmed 
that the NB classifier has outperformed the SVM and MaxE classifiers, however, it is not clear which 
combination of features should be used in what scenarios i.e., sentiment classification tasks. In a recent work 
proposed in Reference [14] by Sohrabi and Hemmatian, Fatemeh, the authors used a dataset of social media 
posts obtained from Twitter to conduct a comparative study between several machine learning SA techniques 
(decision tree, neural network, SVM, and W-Bayes network) at sentence level. The authors used both the 
RapidMiner software package and Python programing language to make this comparison. In the proposed 
methodology, researchers used a series of NLP techniques, including normalization, removing stopwords, 
and tokenization. For feature weighting, the authors used Word2vec and TF-IDF methods. The results 
confirmed significant variations in the accuracy of the utilized SAs after using the proposed NLP techniques. 
In [15], Krouska proposed a methodology to apply SA on three well known datasets, namely Obama-McCain 
debate (OMD), health care reform (HCR) and Stanford twitter sentiment gold standard (STS-Gold). 
Researchers compared several classification techniques using a variety of NLP techniques, including 
removing stop words, stimming, and n-gram tokenization. Results confirmed that n-grams captured by 
extrinsic resources has proved to improve the quality of the SAs. 


2.1. Proposed pipeline and algorithm 

The process of identifying the sentiment orientation of a given sentence passes through several 
phases. The first phase starts with data collection, cleaning and preparation. The second phase aim at feature 
selection and extraction from processes texts. The third phase focuses on training the sentiment classifier and 
testing its quality. In the next sections, we introduce the details of each of these phases. 


2.1.1. Data acquisition and cleansing 

Sources of sentiment sentences can vary on the Web. Twitter and other social media websites are 
among the most commonly referred to sources [2]. For experimental evaluation purposes, we use the well- 
known internet movie database (IMDB) movie reviews dataset, which is publicly available at Kaggle. In 
addition, we collected the 10,662 LightSide dataset, and 300 other generic movie reviews from Twitter. The 
first task after the data acquisition is to clean the raw data as it normally contains some special characters, 
including hashtags, consecutive white spaces, URLs, and unnecessary symbols. In addition, there is a set of 
emoticons that we cleaned using a pre-defined set of icon representations. After this step, we proceed to the 
second phase of data processing, that is tokenization and feature extraction. 


2.1.2. Tokenization and feature extraction 

At this stage, a set of features is extracted from textual information encoded in sentiment sentences. 
As we discussed in the literature review section, there are several SA models that employ different feature 
extraction techniques. For instance, considering the n-gram tokenization process, choosing a specific size for 
n-gram tokens affects the overall’s accuracy of the sentiment analyzer, especially when dealing with 
sentences contain negations, such as "not good" or "not very good" as reported in [2], [16]. In this context, 
we can see that using unigram features to process a sentence with “not good”, will separate the word "good" 
from the negation word "not". Similarly, using the same bigram features in the second scenario, the negation 
word "not" and the word "good" will be separated into two different tokens, these are: “not very” and “very 
good”. In terms of computational cost, if the dictionary size when using unigrams is |D], it will become |D|2 
when using bigrams, and |D|3 for trigrams. Therefore, using n-grams with n greater than 3 is very time 
consuming. In an earlier study in 2004, Pang and Lee has showed that unigram features are more significant 
than bigrams when performing emotional classification for movie reviews [17]. On the other hand, other 
studies showed that coupling bigrams and trigrams based on extrinsic semantic resources provided better 
results than unigrams alone [2], [18]. As reported by Maree and Eleyat in [2], incorporating high-order n- 
grams such as trigrams that can be acquired based on external knowledge resources captures a sentence 
contextual information more precisely than other unigram or bigram-based tokenizers. Accordingly, and in 
light of these conclusions, it is crucial to experimentally evaluate the utilization of n-gram tokenization as 
part of the large-scale SA process. In the next sections, we provide details concerning each of the proposed 
NLP pipeline phase. 


2.1.3. Stopwords removal, word stemming and lemmatization 

Raw text of sentiment sentences usually contains a lot of words that are frequently repeated. Such 
words have no significant contribution in determining the sentiment orientation of sentences. Moreover, they 
may have a negative impact on determining the polarity of sentences as reported in [3]. These words are 
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referred to as stopwords. Removing such words is a crucial step in the data pre-processing phase. Some 
researchers divided stopwords into two types, these are: 1) general stopwords and 2) domain-specific 
stopwords [16]. General stopwords such as (the, a, on, of, ...) are considered as stopwords regardless of the 
domain of the dataset. On the other hand, domain-specific stopwords are those words that appear to be of 
little significance in deriving the meaning of a given sentence in a particular domain [19]. Such words can be 
identified through the utilization of the tf.idf model in the same manner as described in [19]. For example, as 
far as the movie reviews dataset is concerned, words such as, movie, actor, and film appear very frequently in 
the sentiment sentences. These words are specific stopwords to this field and can be regarded as stopwords 
when they appear redundantly in sentences [18]. Stemming, on the other hand, is one of the common 
morphological analysis processes that is mainly concerned with the removal of derivational affixes in the 
hope of achieving a common base form for a given word. The goal of this process is to reduce inflectional 
forms and achieve a common base form for words in sentiment sentences. As reported by Haddi et al. the 
importance of this step lies in reducing text dimensions for sentiment classifiers [12]. In this context, the 
number of dimensions representing different words in the text will be reduced. As such, this allows 
representing a word, in addition to its inflectional forms, as one word. For instances, the words: product, 
producer, production and produce will be reduced to produce [6]. This reduction in word dimensions helps 
also to correctly determine the weights of the words and their importance in the text. In our work, we have 
exploited and evaluated two of the most common stemming techniques that are employed in the context of 
sentiment analysis. These are: Porter [20], [21] and Snowball stemmers. Porter stemmer is one of the oldest 
and most popular stemming algorithms that is still utilized as part of various sentiment analysis pipelines 
[22], [23]. It is a rule-based stemmer that consists of five phases of word reductions that are applied 
sequentially. Each rule group is applied to the longest suffix. One of main limitations of this stemmer is that 
it may return the same form for two words with different sentiment polarities. For instance, the word 
"Dependability" which has a positive polarity and the word "Dependent" which has a negative polarity are 
both stemmed to "Depend". Similarly, Snowball stemmer, which was developed based on Porter's stemmer, 
shares many of the limitations that are inherent in Porter stemmer. However, this technique has demonstrated 
to produce more promising stemming results against those produced by Porter. In addition, as reported by 
Rao in [24], the performance of Snowball is higher than Porter and it can be applied on many languages other 
than English. Similar to stemming, lemmatization is another common morphological analysis process that 
has been widely utilized for sentiment analysis purposes. It is mainly used to remove inflectional endings 
only and to return the base or dictionary form of a word, which is known as the lemma. One of the main 
advantages of lemmatization is that a returned lemma can be further processed to obtain synonyms, as well as 
other semantically-related terms from extrinsic semantic resources such as WordNet and Yago [25]. 


2.2. Features selection and sentiment classification techniques 

Several research works have focused on feature selection from text, namely for sentiment analysis 
purposes [2], [10], [26], [27]. The main features that have captured the focus of these works are term 
frequency, inverse document frequency, n-gram based tokens, and POS-tags of words extracted from the text 
of sentiment sentences. The aim of the feature selection process in this context is to reduce feature vectors in 
an attempt to improve the computation speed and increase the accuracy of the sentiment classification 
methods. However, little attention has been given to the latent semantic aspects of words encoded in 
sentiment sentences [28], [29]. In addition, the composition of the best features that significantly contribute 
to producing the most accurate sentiment classes has not been given sufficient attention. Accordingly, in the 
context of our research work, we attempt to investigate the impact on combining multiple features, including 
semantically-related features, on the quality of a number of machines learning based sentiment classification 
methods. In particular, we employ the well-known TF-IDF feature weighting technique with the combination 
of n-gram tokens and semantic features extracted from WordNet. As such, a sentiment sentence is assigned a 
weight based on (1). The unit we use to represent each sentence is D. We use the variable D to denote 
sentences in the dataset to maintain consistency with the general terms used to define the TF-IDF equation. 


TF-IDF(t) = (1+ log10 TF(t)) * log10 (N/DF(t)), (1) 


Where: 
—  N:is the number of sentiment sentences in the corpus. 
— DF: represents the number of sentences that contain term t. 
— TF: is the number of occurrences of term t in sentence D. 

It is important to point out here that the TF-IDF feature weighting technique is applied on composite 
features extracted from pre-processed text of sentiment sentences. A wide range of machine learning 
(supervised, semi-supervised and un-supervised) techniques have been developed and utilized for sentiment 
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analysis purposes [29], [30]. The input to any of the utilized methods is a set of composite features generated 
based on the techniques discussed in the previous section. As such, the quality of the utilized sentiment 
classifier will be highly dependent on the quality of the selected features, especially when dealing with highly 
skewed datasets. Traditionally, there are various types of classifiers that have been commonly used for 
sentiment analysis. These are: 1) SVM, NB, random forest (RF) and logistic regression (LR). It is important 
to point out that these classifiers are considered among the most commonly used and robust classification 
methods that have been successfully applied for sentiment analysis as reported in Reference [6]. They have 
demonstrated highly accurate classification of sentiment sentences in various application domains [28]. In 
addition, they tend to be less affected by noisy terms, especially when compared with ANN-based sentiment 
classifiers when the data imbalance increases. Furthermore, the time required to train an ANN is usually 
much higher than the that required for training these types of classifiers. 

Figure | shows the phases of the proposed sentiment analysis pipeline. First, we read the dataset 
containing sentences associated with their sentiment classes. Then, we perform data cleansing by removing 
special characters, unwanted signs, case folding and removing numbers. After this step, we apply text 
tokenization using two types of tokenizers Wordpunct_Tokenizer and Casual_Tokenizer. Then, the produced 
tokens are stemmed using two types of stemmers, Porter stemmer and Snowball stemmer. Lemmatization is 
also another step that we apply on the tokens using WordNet lemmatizer, WordNet knowledge-based n-gram 
recognition, and synonyms-based enrichment techniques, noting that we use synonyms of nouns and 
adjectives for term enrichment purposes. 


Sentiment sentences with their sentiment classes 


Data Cleansing 


Remove Stopwords Stemming Lemmatization Knowledge-based n-gram recognition Synonyms-based enrichment 


Porter Snowball stemmer Noun synonems Adjective synonyms 


Remove Stopwords Stemming Lemmatization Knowledge-based n-gram recognition Synonyms-based enrichment 


Porter Snowball stemmer Noun synonyms Adjective synonyms 


E | E 2 


TF-IDF vectonzation 


Feature Vectors 


Testing and validation 


Figure |. Phases of the proposed sentiment analysis pipeline 


Then, in the next step, we select and activate a hybrid set of features for tf-idf vectorization. We use 
the produced vectorization results for training each of the sentiment classifiers. The details of these step are 
presented in Algorithms 1. First, using Algorithm1, we perform data cleansing (steps 2 to 6). Step 10 presents 


Int J Artif Intell, Vol. 12, No. 1, March 2023: 284-294 


Int J Artif Intell ISSN: 2252-8938 o 289 


the tokenization function. In step 11, we invoke the feature selection and activation function. Using this 
function, we select and activate specific feature extraction functions, such as stemming, lemmatization, and 
n-gram recognition. to compose the hybrid features set. In step 13, we submit the composite features for tf-idf 
vectorization. Next, at step 18 we train the classifier and at step 19 we the algorithm carries out the testing 
and validation task and calculates accuracy of the results. 


Algorithm 1. Identifying the sentiment orientation algorithm 

INPUT: Sentiment sentences with their polarity. S < Stopwords >. And list of unwanted signs (%,#,%, 
OUTPUT: list of sentiment orientation predictions for test set reviews and algorithm accuracy. 
PREDICTSENTIMENT(R,S,L) : sentiment orientation, algorithm accuracy 

: reviews < R <r, p;> // list of sentiment sentences with their sentiment polarity 

: while reviews # NIL do: 

: for each word in r; do 

: do DELETE (word) if word € (L or numbers) . 

: word <- CASEFOLDING (word) 

end for 

:F<—<>//F: list of features 

: while Next r; in reviews do 

:T —<>//T: list of tokens for 1; 

0: T — TOKENIZE (1) // tokenization using Wordpunct_Tokenize or Casual_Tokenize 
//Next function (Step 11) constructs composite feature using the helper functions that start at step 21. 
: F — FEATURESELECTIONANDACTIVATION(T) 

: class — < p;> 

: vectoriser — TFIDFVECTORIZER(F) 

: TrainSet < 0,70 x(vectoriser, class) //Split vectors in to TrainSet and TestSet 

: TestSet — REST (vectoriser, class) 

: trained classifer— TRAINCLASSIFER(TrainSet) 

P «<>//P list of sentiment orientation predictions for test set reviews 

: P < trained classifer. predict (TestSet) 

: accuracy <CALCULATEACCURICY (P, TestSet) 

: return P, accuracy 


), L < unwanted signs > 


NY 


Next, for identifying the sentiment orientation we SentiWordNet lexicon. In particular, we use this 
lexicon to find SentiWordNet scores for every token in each review. We accumulate positive scores and 
negative scores per token. We use the accumulated scores to determine sentence sentiment orientations and 
produce predictions for the used datasets associated with their accuracy metrics. Our goal of using 
SentiWordNet in this context is to find out how a sentiment classifier’s accuracy can be affected by the 
incorporation of prior term polarity information. 


3. RESULTS AND DISCUSSION 
To perform the experiments and develop the proposed NLP pipeline, we used Python program 
language. This language was also utilized for pre-processing the three publicly-available datasets that are 


presented in Table 1. Before proceeding with the discussion on the conducted experiments, we present some 
details about the used datasets. 


Table 1. Statistics about the used sentiment review datasets 


s Sentiment ,... . _ Average Unigrams Average Unigrams & Bi- Average Unigrams without 
Patel Sentences Fever Neeauve Per Review grams Per Review Stopwords Per Review 
IMDB 50.000 25.000 25.000 270.82 261.4352 148.5876 
Se nieny Salenery iam 10.662 5.331 5.3331 2.71264 21.09316 13.75654 
Reference [2] 
LightSide’s Movie Reviews 3000 150 150 698.5533 674.4867 396.12 


In our experiments, we classified the reviews using four machine learning algorithms. These are: 
SVMs, NB, LR, and RF classifiers. And we have used accuracy metric to compare between these classifiers. 
Accuracy in this context is the ratio of correctly classified reviews to the total reviews. (2) explains how to 
calculate accuracy, where a sentiment prediction is classified into true positive (TP), true negative (TN), false 
positive (FP) and false negative (FN). 


number of correctly classified reviews tp+tn 
accuracy= = 


= 2 
number of reviews tp+tn+fp+fn ( ) 
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We also used the natural language toolkit (NLTK) to apply some NLP techniques. Using the NLTK 
library, we employed two types of tokenizers namely: wordpunct and casual tokenizers. We have also used uni- 
grams features and n-grams features (uni-grams and bi-grams). Moreover, we have updated NLTK stopword list 
to filter out sentences. It is important to point out that we removed all negation words from the predefined 
stopwords list. This is because such words have an important impact on identified correct sentiment 
orientations. We also enriched the list with additional words that appear frequently in the datasets (domain- 
specific stopwords). Example of such words are: story, movies, watching, acting, character, and film. 
Accordingly, we exploited a number of NLP pipelines to study the impact of each pipeline phases on the quality 
of each classifier. Tables 2-3 illustrate the variations on the accuracy results when utilizing each pipeline. 


Table 2. Experimental results using Wordpunct_Tokenize 


Datasets SVM NB LG RF 
IMDB 90.27% 85.96% 89.84% 84.42% 
Original Text Sentiment_sentences 76.52% 77.61% 75.52% 68.23% 
Movie Review 70.00% 52.22% 67.78% 75.56 
IMDB 90.20% 85.87% 89.71% 84.38% 
Remove unwanted symbols and numbers Sentiment_sentences 76.64% 77.55% 75.45% 68.95% 
Movie Review TLAL% = =52.22% 67.78% 68.89% 
IMDB 89.81% 86.61% 89.77% 86.22% 
Stopwords Removal Sentiment_sentences 74.64% 76.27% 74.68% 68.11% 
Movie Review 64.44% 56.67% 61.11% 68.89% 
IMDB 89.95% 85.25% 89.61% 84.08% 
Porter stemmer Sentiment_sentences 77.49% T7A1% 76.17% 69.92% 
Movie Review 70.00% 52.22% 63.33% 72.22% 
IMDB 89.51% 85.66% 89.37% 85.37% 
Porter stemmer + Stopwords Removal Sentiment_sentences 75.23% 76.11% 74.36% 69.45% 
Movie Review 64.44% 56.67% 64.44% 65.56% 
IMDB 89.90% 85.24% 89.60% 84.61% 
Snowball stemmer Sentiment_sentences 78.08% 77.64% 76.49% 70.33% 
Movie Review TLAL% = 52.22% 65.56% 75.56% 
IMDB 89.44% 85.57% 89.18% 85.12% 
Snowball stemmer + Stopwords Removal Sentiment_sentences 76.14% 76.64% 75.20% 69.54% 
Movie Review 63.33% 57.78% 63.33% 58.89% 
IMDB 90.07% 85.63% 89.74% 84.40% 
Lemmatizer Sentiment_sentences 76.83% 77.08% 75.52% 69.14% 
Movie Review TLAL% = =S1.A1% =§=—66.67% ~=—- 72.22% 
IMDB 89.63% 86.37% 89.51% 85.85% 
Lemmatizer + Stopwords Removal Sentiment_sentences 74.89% 76.17% 74.14% 68.51% 
Movie Review 66.67% 55.56% 64.44% 56.67% 
IMDB 90.39% 86.01% 89.97% 84.88% 
Uni-gram &Bi-gram Sentiment_sentences 77.11% 77.55% 75.08% 68.36% 
Movie Review TLAL% = 52.22% 65.56% 67.78% 
IMDB 90.17% 86.80% 90.02% 86.27% 
Uni-gram &Bi-gram Stopwords Removal Sentiment_sentences 74.86% 76.95% 74.73% 68.76% 
Movie Review 64.44% 56.67% 61.11% 73.33% 
IMDB 90.15% 85.51% 89.79% 84.30% 
Uni-gram &Bi-gram + Porter stemmer Sentiment_sentences 77.20% 77.24% 76.05% 69.95% 
Movie Review 68.89%  52.222% 66.67% 71.11% 
IMDB 90.09% 85.35% 89.73% 84.68% 
Uni-gram &Bi-gram + Snowball stemmer Sentiment_sentences 77.49% 77.77% 76.49% 69.45% 
Movie Review 68.89% 52.22% 66.67% 77.78% 
IMDB 90.43% 85.99% 89.95% 84.75% 
Uni-gram &Bi-gram + Lemmatizer Sentiment_sentences 76.67% 77.39% 74.92% 69.04% 
Movie Review 70.00% S1.11% 64.44% 72.22% 
IMDB 90.19% 86.59% 89.98% 84.45% 
Uni-gram + adj synonyms Sentiment_sentences 76.45% 77.17% 76.08% 69.23% 
Movie Review 72.22% 54.44% 70.00% 73.33% 
IMDB 90.07% 84.75% 89.24% 84.01 
Uni-gram + Noun synonyms Sentiment_sentences 74.80% 75.30% 73.89% 69.48% 
Movie Review 64.44% 52.22% 62.22% 74.44% 
IMDB 89.99% 84.99% 89.33% 84.28% 
Uni-gram, Bi-grams +Noun synonyms + lemmatization Sentiment_sentences 74.67% 75.39% 73.42% 68.76% 
Movie Review 65.56% S1.11% 65.56% 71.11% 
IMDB 90.30% 86.83% 90.11% 84.95% 
Uni-gram, Bi-grams +Adj synonems + lemmatization Sentiment_sentences 76.08% 77.55% 75.30% 68.54% 
Movie Review 74.44% 51.11% 68.89% 65.56% 
IMDB 90.33% 86.77% 90.01% 85.15% 
Uni-gram, Bi-grams + Adj synonyms Sentiment_sentences 76.42% 77.05% 75.92% 68.89% 
Movie Review 73.33% 52.22% 67.78% 82.22% 


Int J Artif Intell, Vol. 12, No. 1, March 2023: 284-294 


Int J Artif Intell ISSN: 2252-8938 ) 291 


Table 3. Experiments results using lexicon-based sentiment analysis 


Dataset SentiWordNet | Wordpunct Tokenize Casual Tokenize 
IMDB 65.00% 90.40% 90.50% 
Sentiment Sentences 58.00% 78.08% 77.64% 
Movie Review 59.00% 82.22% 75.56% 


In our experiments we have used 19 different combinations of NLP techniques. Each combination 
was applied twice, using Wordpunct tokenizers. Each combination of NLP techniques produced different set 
of features. Which means different results for the sentiment reviews polarity. Based on the results in 
Tables 2-3, we notice that all classifiers are positively affected when we used n-grams, snowball stemmer, 
sentence enrichment with adjective synonyms, lemmatization, and when we removed unwanted signs and 
numbers. Also, when we have a combination of them. This means that these NLP techniques have an 
important and positive impact in determining the polarity of reviews. Also, we note that when we do 
synonyms enrichment to adjectives, we obtain better results than when perform noun-based enrichment. 
When we used the Wordpunct_tokenizer with the IMDB dataset, we obtained the best result (90.43%). In 
particular, when we used SVM classifier with uni-grams and bi-grams with lemmatization. Both the NB and 
LR classifiers produced (86.83%) and (90.11%) accuracy results when we used n-grams feature with 
sentence enrichment by Adj synonyms and lemmatization. The RF classifier produced the best result 
(86.20%) when we used n-grams with stopwords removal. Considering the Sentiment Sentences dataset that 
is used by the authors of [2], we obtained the best result (78.08%) when using the SVM classifier with 
Snowball stemmer. Also, the RF classifier produced the best predictions (70.33%) when we used Snowball 
stemmer. As far as the NB and LR classifiers are concerned, they produced the best results (77.77%) and 
(76.49%), respectively when we using n-grams with Snowball stemmer. For the LightSide’s Movie Reviews 
dataset, we obtained the best result (82.22%) when we used the RF classifier with n_grams and Adjective 
synonyms combination. Also, the SVM classifier produced the best result (74.44%) when we used n_grams, 
sentence enrichment with Adjective synonyms and lemmatization. The NB classifier produced the best result 
(57.78%) when using Snowball stemmer with Stopwords removal. While the LR produced best result 
(70.00%) when we used sentence enrichment with Adjective synonyms. When using the Casual_tokenizer, 
we find that for the IMDB dataset, we got the best result (90.50%) with the SVM classifier when applied on 
the original text. On the other hand, the NB classifier produced the best result (86.91%) when using n-gram 
features with sentence enrichment by Adj synonyms and lemmatization. Both the LR and RF classifiers 
produced the best results (90.06%) and (86.13%) when we used n-grams with stopwords removal. In 
Sentiment_sentences dataset We got the best result (77.64%) when with the SVM classifier when employing 
the Snowball stemmer. Also, both the NB and LR classifiers produced the best results (77.58%) and 
(76.52%), respectively when we used Snowball stemmer. For the RF classifier, it produced the best result 
(69.95%) when we used the Snowball stemmer with Stopwords removal. For the LightSide’s 
Movie_Reviews dataset, we obtained the best result (75.56%) when using the RF classifier with n_grams and 
sentence enrichment with Adjective synonyms. Also, the LR classifier produced the best result (70.00%) 
when we used n_grams with sentence enrichment with Adjective synonyms. On the other hand, the NB 
classifier produced the best result (58.89%) when we used Stopwords removal. While the SVM classifier 
produced the best result (70.00%) when we used sentence enrichment with Adjective synonyms, lemmatization 
or snowball stemmer. As illustrated from the results in Table 3, we can see that the obtained results using the 
lexicon-based sentiment analysis technique are less precise than their counterparts. This is mainly due to the fact 
that this approach relies on the accumulative sum of the pre-defined polarity of review tokens. Compared to the 
results of the previous approaches, we note that this approach ignores a word's contextual polarity in a given 
review. Additionally, the latent semantic dimensions of tokens are ignored in this model. 


3.1. Comparison with other SA models 

In this section, we present a comparison between the results that we obtained when using the IMDB 
dataset with respect to similar previous works, namely [26], [31]. The researchers used a group of supervised 
learning algorithms such as the SVM, NB, K-nearest neighbor (KNN) and Maximum Entropy. The results 
obtained by the researchers are shown in Table 4. As shown in this table, the Maximum Entropy classifier 
proved to outperform the rest of the classifiers as it produced the highest accuracy result which is 83.93%. It 
was followed by the SVM classifier with an accuracy of 67.60%. The authors of [26] proposed models to 
classify the IMDB reviews by using six layers in neural network architecture, and they utilized word weights 
to predict sentiment classes. In their work, the authors used two methods to extract the word polarity 
(positive or negative). In particular, they used two manually-constructed lists of pre-defined positive and 
negative words. In addition, they created word ranks by calculating sentiment and measuring domain 
relevance parameters. The researchers obtained a training accuracy of 91.90% and an accuracy of 86.67% for 
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validation. In References [32] Sahu and Ahuja employed SentiWordNet and n-grams feature selection to 
perform feature extraction and ranking. A group of supervised learning algorithms were used where the best 
accuracy achieved was 88.95% when using the RF classifier. In [27], [33], [34], the authors focused on using 
a convolutional neural network (CNN) and long short-term memory (LSTM) to examine sentiment polarity. 
The best sentiment class prediction accuracy that was obtained using this model was 89.50%. 


Table 4. Comparison with existing SA models 


System Employed Classifier Accuracy 
Our Result SVM 90.43% 
Sentiment analysis on IMDB using lexicon and neural networks [26] _ lexicon and neural networks 86.67% 
Sentiment Analysis of IMDB Movie Reviews Using Long Short- Long Short-Term Memory 89.90% 
Term Memory [27] 
Sentiment analysis of movie reviews: A study on feature selection & RF 88.95% 
classification algorithms [32] 
Single and Multibranch CNN-Bidirectional LSTM for IMDB Single and Multi-branch CNN-Bidirectional 89.54% 
Sentiment Analysis [33] LSTM 
Deep CNN-LSTM with combined kernels from multiple branches CNN-LSTM 89.50% 
for IMDB review sentiment analysis [34] 
Sentiment Analysis on IMDB Movie Reviews Using Hybrid Feature © Maximum Entropy 83.93% 


Extraction Method [35] 


4. CONCLUSION 

Current lexicon-based and machine learning based sentiment analysis approaches are still hindered 
by limitations, such as the insufficient knowledge about words’ prior polarities and their contextual polarities 
in sentences, and the manual training required for predicting the sentiment classes of sentences, which is time 
consuming, domain-dependent and error-prone. In this article, we experimentally evaluated four probabilistic 
and machine learning based sentiment classifiers, namely SVM, NB, LR and RFs. We evaluated the quality 
of each of these techniques in predicting the sentiment orientation using three real-world datasets that 
comprised a total of 60,962 sentiment sentences. Our main goal was to study the impact of a variety of NLP 
pipelines on the quality of the predicted sentiment orientations. We also compared that with the utilization of 
lexical semantic knowledge captured by WordNet. Findings demonstrate that there is an impact of varying 
the employed NLP pipelines and the incorporation of semantic relationships on the quality of the sentiment 
analyzers. In particular, we concluded that coupling lemmatization and knowledge-based n-gram features 
proved to produce higher accuracy rates when applied on the IMDB dataset. With this coupling, the accuracy 
of the SVM classifier has improved to be 90.43%. For the three other classifiers (NB, RF and LR), the 
quality of the sentiment classification has also improved to be 86.83%, 90.11%, 86.20%, respectively. It is 
important to highlight the fact that the used pipeline phases are of less complexity in terms of configuring 
their hyper-parameters, as well as possibly computational cost when compared with other deep learning 
models. This is mainly because there are normally several hyper-parameters that require manual 
configuration and fine-tuning one the one hand, and also the existence of multiple layers and neurons in 
conventional neural networks; making it require more time to be trained, configured and also for processing 
the input sentences and assigning labels to sentiment sentences. As such, we conclude that obtaining highly 
accurate sentiment prediction results can still be achieved using a composition of conventional NLP 
processes that are comparable and, in some cases, more superior than complex architectures. Nevertheless, in 
the future work, we will study the impact on varying the composite futures extracted from sentiment 
sentences on the quality of neural network and pre-trained text processing models. For this purpose, we will 
utilize the pipeline phase introduced in this article to evaluate the LSTM, Word2Vec and bidirectional 
encoder representations from transformers (BERT) models. 
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